Data Mining and Visualization - Pre-Pandemic NHANES Examination Data

blogging
jupyter
Python
Back to Basics
Author

Kashish Mukheja

Published

Wednesday, 31 January 2024

A Report by …

@Author: Kashish Mukheja

Introduction:

Macroshot.jpg

In this data analysis report, we delve into the exploration of dataseets downloaded from 2017-March 2020 Pre-Pandemic Examination Data - NHANES dataset. We downloaded the following 5 datasets:

  1. P_ACQ.XPT - Pre-Pandemic Questionnaire Acculturation Data
  2. P_AUX.XPT - Pre-Pandemic Examination Audiometry Data
  3. P_DEMO.XPT - Pre-Pandemic Demographics Variables and Sample Weights Data
  4. P_BPXO.XPT - Pre-Pandemic Examination Blood Pressure - Oscillometric Measurement Data
  5. P_BMX.XPT - Pre-Pandemic Examination Body Measures Data

On data cleaning, we merged the dataset comprising multiple variables from all the above sources. Our goal is to gain insights into the relationships and patterns present within the data and to create interactive visualizations that facilitate a deeper understanding of the dataset. Throughout the analysis, we pay careful attention to data privacy and ethical considerations, ensuring that our conclusions are drawn responsibly and transparently.

Import Libraries

We’ll start by importing the necessary libraries

import pandas as pd
import numpy as np

from bokeh.resources import INLINE
from bokeh.io import output_notebook, show
from bokeh.layouts import gridplot, column
from bokeh.models import ColumnDataSource, HoverTool, CustomJS
from bokeh.plotting import figure
from bokeh.models.widgets import Slider, TextInput
from bokeh.models import DataTable, TableColumn
import bokeh.io

Data Preparation and Merging

In this section, we detail the process of preparing and merging data from multiple XPT files into a single dataframe.

Summary:

We began by loading the data from five different XPT files, each representing distinct datasets: P_ACQ, P_AUX, P_DEMO, P_BPXO, and P_BMX. These datasets contained various information related to Acculturation, Audiometry data, demographic details, blood pressure examination, and body measures, respectively.

To ensure data quality, we performed two key data preparation steps.

Firstly, we removed columns with more than 25% null values from each dataframe, ensuring that we retained only relevant and informative features.

Secondly, we eliminated any duplicate rows within each dataframe to prevent redundancy and maintain data integrity.

After completing the data preparation steps for each individual dataframe, we merged them based on a common identifier, the “SEQN” column. This facilitated the creation of a comprehensive dataset containing consolidated information from all five datasets.

The resulting merged dataframe, named merged_df, provides a unified view of the data, enabling further analysis and exploration of relationships between different variables across the datasets. This consolidated dataset serves as a valuable resource for gaining insights and making informed decisions in subsequent data analysis tasks.

# Read P_ACQ.XPT into a dataframe
df_acq = pd.read_sas('P_ACQ.XPT', format='xport')

# Read P_AUX.XPT into a dataframe
df_aux = pd.read_sas('P_AUX.XPT', format='xport')

# Read P_DEMO.XPT into a dataframe
df_demo = pd.read_sas('P_DEMO.XPT', format='xport')

# Read P_BPXO.XPT into a dataframe
df_bpxo = pd.read_sas('P_BPXO.XPT', format='xport')

# Read P_BMX.XPT into a dataframe
df_bmx = pd.read_sas('P_BMX.XPT', format='xport')
# List of dataframes
dfs = [df_acq, df_aux, df_demo, df_bpxo, df_bmx]

# Remove columns with > 25% null values and remove duplicated rows
for df in dfs:
    # Remove columns with > 25% null values
    df.dropna(thresh=len(df) * 0.75, axis=1, inplace=True)
    # Remove duplicated rows
    df.drop_duplicates(inplace=True)

# Merge dataframes on "SEQN" column
merged_df = pd.merge(df_acq, df_aux, on='SEQN', how='inner')
merged_df = pd.merge(merged_df, df_demo, on='SEQN', how='inner')
merged_df = pd.merge(merged_df, df_bpxo, on='SEQN', how='inner')
merged_df = pd.merge(merged_df, df_bmx, on='SEQN', how='inner')

Understanding the Data

Before performing any data visualisation, it is important to understand the dataframe, specifically what type of data is present. This will empower us to understand if there is any further data cleaning required.

merged_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4605 entries, 0 to 4604
Data columns (total 82 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   SEQN      4605 non-null   float64
 1   AUAEXSTS  4605 non-null   float64
 2   AUXOTSPL  4276 non-null   float64
 3   AUXLOEXC  4276 non-null   float64
 4   AUXLOIMC  4276 non-null   float64
 5   AUXLOCOL  4276 non-null   float64
 6   AUXLOABN  4276 non-null   float64
 7   AUXROTSP  4276 non-null   float64
 8   AUXROEXC  4276 non-null   float64
 9   AUXROIMC  4276 non-null   float64
 10  AUXROCOL  4276 non-null   float64
 11  AUXROABN  4276 non-null   float64
 12  AUXTMEPR  3813 non-null   float64
 13  AUXTPVR   3985 non-null   float64
 14  AUXTWIDR  3813 non-null   float64
 15  AUXTCOMR  3815 non-null   float64
 16  AUXTMEPL  3813 non-null   float64
 17  AUXTPVL   3983 non-null   float64
 18  AUXTWIDL  3815 non-null   float64
 19  AUXTCOML  3818 non-null   float64
 20  AUAEAR    4251 non-null   float64
 21  AUAMODE   4246 non-null   float64
 22  AUAFMANL  4246 non-null   float64
 23  AUAFMANR  4246 non-null   float64
 24  AUXU1K1R  4209 non-null   float64
 25  AUXU500R  4206 non-null   float64
 26  AUXU1K2R  4209 non-null   float64
 27  AUXU2KR   4208 non-null   float64
 28  AUXU3KR   4203 non-null   float64
 29  AUXU4KR   4204 non-null   float64
 30  AUXU6KR   4201 non-null   float64
 31  AUXU8KR   4201 non-null   float64
 32  AUXU1K1L  4201 non-null   float64
 33  AUXU500L  4200 non-null   float64
 34  AUXU1K2L  4201 non-null   float64
 35  AUXU2KL   4200 non-null   float64
 36  AUXU3KL   4200 non-null   float64
 37  AUXU4KL   4200 non-null   float64
 38  AUXU6KL   4200 non-null   float64
 39  AUXU8KL   4199 non-null   float64
 40  AUAREQC   4245 non-null   float64
 41  AUATYMTR  4605 non-null   object 
 42  AUALEQC   4245 non-null   float64
 43  AUATYMTL  4605 non-null   object 
 44  SDDSRVYR  4605 non-null   float64
 45  RIDSTATR  4605 non-null   float64
 46  RIAGENDR  4605 non-null   float64
 47  RIDAGEYR  4605 non-null   float64
 48  RIDRETH1  4605 non-null   float64
 49  RIDRETH3  4605 non-null   float64
 50  RIDEXMON  4605 non-null   float64
 51  DMDBORN4  4605 non-null   float64
 52  SIALANG   4605 non-null   float64
 53  SIAPROXY  4605 non-null   float64
 54  SIAINTRP  4605 non-null   float64
 55  FIALANG   4352 non-null   float64
 56  FIAPROXY  4352 non-null   float64
 57  FIAINTRP  4352 non-null   float64
 58  WTINTPRP  4605 non-null   float64
 59  WTMECPRP  4605 non-null   float64
 60  SDMVPSU   4605 non-null   float64
 61  SDMVSTRA  4605 non-null   float64
 62  INDFMPIR  4037 non-null   float64
 63  BPAOARM   4605 non-null   object 
 64  BPAOCSZ   3996 non-null   float64
 65  BPXOSY1   3985 non-null   float64
 66  BPXODI1   3985 non-null   float64
 67  BPXOSY2   3965 non-null   float64
 68  BPXODI2   3965 non-null   float64
 69  BPXOSY3   3934 non-null   float64
 70  BPXODI3   3934 non-null   float64
 71  BPXOPLS1  3624 non-null   float64
 72  BPXOPLS2  3606 non-null   float64
 73  BPXOPLS3  3578 non-null   float64
 74  BMDSTATS  4605 non-null   float64
 75  BMXWT     4503 non-null   float64
 76  BMXHT     4498 non-null   float64
 77  BMXBMI    4489 non-null   float64
 78  BMXLEG    4298 non-null   float64
 79  BMXARML   4347 non-null   float64
 80  BMXARMC   4345 non-null   float64
 81  BMXWAIST  4295 non-null   float64
dtypes: float64(79), object(3)
memory usage: 2.9+ MB
  • Column AUATYMTL and BPAOARM have object dtype.
merged_df['AUATYMTL'].value_counts()
AUATYMTL
b'Type A'     3583
b''            360
b'88888'       252
b'Type AS'     154
b'Type C'       99
b'99999'        85
b'Type B'       36
b'Type AD'      36
Name: count, dtype: int64
merged_df['BPAOARM'].value_counts()
BPAOARM
b'R'    3626
b''      962
b'L'      17
Name: count, dtype: int64

Identifying Top Correlated Columns for Visualization

In this section, the goal is to identify the top correlated columns within the dataframe merged_df in order to focus on creating visualizations for these columns in the subsequent sections.

from sklearn.preprocessing import LabelEncoder

# Copy the dataframe to avoid modifying the original dataframe
df_encoded = merged_df.copy()

# Initialize LabelEncoder
label_encoder = LabelEncoder()

# Encode string columns
for column in df_encoded.columns:
    if df_encoded[column].dtype == 'object':
        df_encoded[column] = label_encoder.fit_transform(df_encoded[column].astype(str))

# Compute correlation matrix
correlation_matrix = df_encoded.corr()

# Get the absolute correlation values for each column
correlation_values = correlation_matrix.abs()

# Get the columns with the highest correlation values
# We exclude the diagonal and duplicate entries by masking them with a boolean mask
# Here, we choose the top 10 correlated columns, you can adjust this number as needed
top_correlated_columns = (correlation_values.mask(~np.tri(*correlation_values.shape, k=-1, dtype=bool))
                          .stack()
                          .sort_values(ascending=False)
                          .head(10)
                          .reset_index())

# Rename columns for clarity
top_correlated_columns.columns = ['Column 1', 'Column 2', 'Correlation Value']

# Display the top correlated columns
print(top_correlated_columns)
   Column 1  Column 2  Correlation Value
0  AUXU1K2R  AUXU1K1R           0.999666
1  AUXU1K2L  AUXU1K1L           0.997321
2  WTMECPRP  WTINTPRP           0.995129
3   AUXU2KR  AUXU1K2R           0.964144
4   AUXU2KR  AUXU1K1R           0.964085
5  AUXROEXC  AUXROTSP           0.959811
6  RIDRETH3  RIDRETH1           0.958959
7  AUXLOEXC  AUXOTSPL           0.957510
8   AUXU2KL  AUXU1K2L           0.955210
9   AUXU2KL  AUXU1K1L           0.952925

Summary:

To begin, the code utilizes the LabelEncoder from the sklearn.preprocessing module to encode string columns within the dataframe. This is essential as string columns cannot be directly used to compute correlations. By encoding them into numerical values, the data becomes suitable for correlation analysis.

A for loop iterates through each column in the dataframe, and if the column’s data type is ‘object’ (indicating it is a string column), it is encoded using the LabelEncoder.

Next, the code computes the correlation matrix for the encoded dataframe, followed by extracting the absolute correlation values for each column pair.

To select the top correlated columns, the code employs a boolean mask to exclude the diagonal and duplicate entries in the correlation matrix. This ensures that each column is only compared with others, and not with itself or duplicates.

Finally, the top correlated column pairs along with their correlation values are printed to the console. This information will guide the subsequent visualization process, focusing on the most informative relationships within the data. Adjustments can be made to the number of top correlated columns considered for visualization as needed.

Interactive Visualizations from Merged DataFrame

This subheading introduces a section of code that creates interactive visualizations using the Bokeh package. The code is designed to work with a DataFrame called merged_df containing data, and it aims to provide a visual exploration of selected columns.

bokeh.io.output_notebook(INLINE)
Loading BokehJS ...
# Output to notebook
output_notebook()

# Select only the columns from the top correlated pairs
selected_columns = ['AUXU1K2R', 'AUXU1K1R', 'AUXU1K2L', 'AUXU1K1L', 'WTMECPRP', 'WTINTPRP',
                    'AUXROEXC', 'AUXROTSP', 'RIDRETH3', 'RIDRETH1', 'AUXLOEXC', 'AUXOTSPL']

# Create a copy of the DataFrame to avoid the SettingWithCopyWarning
df_subset = merged_df[selected_columns].copy()

# Drop rows with NaN values from the copied DataFrame
df_subset.dropna(inplace=True)

# Convert DataFrame to ColumnDataSource
source = ColumnDataSource(df_subset)

# Scatter plot
scatter = figure(title="Scatter Plot", x_axis_label='AUXU1K2R', y_axis_label='AUXU1K1R')
scatter.circle('AUXU1K2R', 'AUXU1K1R', size=10, source=source, color='navy', legend_label='Correlation Value')
scatter_hover = HoverTool(tooltips=[("AUXU1K2R", "@AUXU1K2R"), ("AUXU1K1R", "@AUXU1K1R")])
scatter.add_tools(scatter_hover)

# Line plot
line = figure(title="Line Plot", x_axis_label='WTMECPRP', y_axis_label='WTINTPRP')
line.line('WTMECPRP', 'WTINTPRP', source=source, line_width=2, line_color="orange")

# Convert numerical values to strings
x_values = [str(value) for value in df_subset['AUXROEXC'].unique()]

# Bar plot
bar = figure(title="Bar Plot", x_range=x_values, x_axis_label='AUXROEXC', y_axis_label='AUXROTSP')
bar.vbar(x='AUXROEXC', top='AUXROTSP', width=0.5, source=source, color='navy', legend_label='AUXROTSP')
bar.xaxis.major_label_orientation = 1.2

# Convert numerical values to strings
x_values = [str(value) for value in df_subset['RIDRETH3'].unique()]
y_values = [str(value) for value in df_subset['RIDRETH1'].unique()]

# Heatmap
heatmap = figure(title="Heatmap", x_range=x_values, y_range=y_values,
                 x_axis_label='RIDRETH3', y_axis_label='RIDRETH1')
heatmap.rect(x='RIDRETH3', y='RIDRETH1', width=1, height=1, source=source,
             fill_color='blue', line_color=None)
heatmap.xaxis.major_label_orientation = 1.2

# Stacked bar chart
stacked_bar = figure(title="Stacked Bar Chart", x_range=x_values, x_axis_label='RIDRETH3', y_axis_label='AUXU1K1R')
stacked_bar.vbar_stack(stackers=selected_columns[:2], x='RIDRETH3', width=0.5, color=['blue', 'orange'], source=source,
                       legend_label=['AUXU1K2R', 'AUXU1K1R'])

# Slider for controlling circle size in scatter plot
slider_circle_size = Slider(start=1, end=20, value=10, step=1, title="Circle Size")

# JavaScript callback to update circle size
callback_update_circle_size = CustomJS(args=dict(circle=scatter), code="""
    circle.glyph.size = cb_obj.value;
""")
slider_circle_size.js_on_change('value', callback_update_circle_size)

# Slider for controlling line width in line plot
slider_line_width = Slider(start=1, end=5, value=2, step=0.5, title="Line Width")

# JavaScript callback to update line width
callback_update_line_width = CustomJS(args=dict(line=line), code="""
    line.glyph.line_width = cb_obj.value;
""")
slider_line_width.js_on_change('value', callback_update_line_width)

# Text input for filtering bar plot
text_input_filter = TextInput(value='', title='Filter Bar Plot')

# JavaScript callback to filter bar plot
callback_filter_bar = CustomJS(args=dict(source=source, bar=bar), code="""
    var data = source.data;
    var filterValue = cb_obj.value.toLowerCase().trim();  // Convert input value to lowercase and remove leading/trailing whitespace
    var new_data = {'AUXU1K2R': [], 'AUXU1K1R': [], 'AUXU1K2L': [], 'AUXU1K1L': [], 'WTMECPRP': [],
                    'WTINTPRP': [], 'AUXROEXC': [], 'AUXROTSP': [], 'RIDRETH3': [], 'RIDRETH1': [],
                    'AUXLOEXC': [], 'AUXOTSPL': []};  // Initialize new data object
    
    console.log("Entered value:", cb_obj);
    
    // Choose a column to iterate over (assuming it has the same length as others)
    var columnLength = data['AUXU1K2R'].length;
    
    // Check if all columns have the same length
    for (var key in data) {
        if (data[key].length !== columnLength) {
            console.error("Columns have different lengths.");
            return;
        }
    }
    
    // Iterate over the chosen column length
    for (var i = 0; i < columnLength; i++) {
        // Check if any column contains the filter value (case-insensitive)
        var matchFound = false;
        for (var key in data) {
            if (data[key][i].toString().toLowerCase().includes(filterValue)) {
                matchFound = true;
                break;
            }
        }
        
        // If a match is found, include the corresponding row in the new data
        if (matchFound) {
            for (var key in new_data) {
                new_data[key].push(data[key][i]);
            }
        }
    }
    
    // Update the source data with the filtered data
    source.data = new_data;
    source.change.emit();
""")
text_input_filter.js_on_change('value', callback_filter_bar)

# Data table
columns = [
    TableColumn(field=column, title=column) for column in selected_columns
]
data_table = DataTable(source=source, columns=columns, width=800, height=280)

# Combine plots into a grid layout
plots_grid = gridplot([[scatter, line], [bar, heatmap], [stacked_bar]])

# Combine widgets into a column layout
widgets_column = column(slider_circle_size, slider_line_width, text_input_filter)

# Combine plots grid and widgets column into a layout
layout = column(plots_grid, widgets_column)

# Show the result
show(layout)
Loading BokehJS ...

Summary:

The provided code performs the following tasks:

  1. Output File: Specifies the output file name as “interactive_visualizations.html” where the visualizations will be rendered.

  2. Select Relevant Columns: Defines a list of columns (selected_columns) to be used for creating visualizations. These columns are selected based on their high correlation values.

  3. Subset Data: Creates a copy of merged_df containing only the selected columns (df_subset). It drops any rows with missing values to ensure data integrity.

  4. Convert to ColumnDataSource: Converts the subset DataFrame (df_subset) into a ColumnDataSource object (source), which is a requirement for creating Bokeh plots.

  5. Create Visualizations:

    • Scatter Plot: Plots the relationship between ‘AUXU1K2R’ and ‘AUXU1K1R’ columns.
    • Line Plot: Shows the trend between ‘WTMECPRP’ and ‘WTINTPRP’ columns.
    • Bar Plot: Displays the distribution of ‘AUXROTSP’ values across different levels of ‘AUXROEXC’.
    • Heatmap: Represents the relationship between ‘RIDRETH3’ and ‘RIDRETH1’ columns using color intensity.
    • Stacked Bar Chart: Illustrates the comparison between ‘AUXU1K2R’ and ‘AUXU1K1R’ across various levels of ‘RIDRETH3’.
    • Data Table: Provides a tabular view of the selected columns for reference.
  6. Combine Plots: Organizes the visualizations and the data table into a grid layout using gridplot.

  7. Show the Result: Renders the layout to the specified output file, enabling interaction with the visualizations in a web browser.

This section of code facilitates the exploration and analysis of the data by providing interactive visualizations that allow for deeper insights into the relationships and distributions within the dataset.

Insights and Reflections

Insights:

  1. Scatter Plot Analysis: The scatter plot reveals a strong positive correlation between ‘AUXU1K2R’ and ‘AUXU1K1R’. This suggests that there is a linear relationship between these two variables, indicating that as one variable increases, the other tends to increase as well.

  2. Line Plot Observation: The line plot shows a consistent trend between ‘WTMECPRP’ and ‘WTINTPRP’, indicating a stable relationship between the two variables over time.

  3. Bar Plot Interpretation: The bar plot demonstrates the distribution of ‘AUXROTSP’ values across different categories of ‘AUXROEXC’. It highlights potential variations in ‘AUXROTSP’ based on the levels of ‘AUXROEXC’, providing insights into potential factors influencing ‘AUXROTSP’.

  4. Heatmap Exploration: The heatmap visualizes the relationship between ‘RIDRETH3’ and ‘RIDRETH1’, showcasing potential patterns or clusters within the data. It allows for the identification of any trends or disparities across different demographic groups.

  5. Stacked Bar Chart Insight: The stacked bar chart provides a comparison of ‘AUXU1K2R’ and ‘AUXU1K1R’ across various levels of ‘RIDRETH3’. It enables the exploration of differences or similarities in the distribution of these variables within different demographic groups.

Reflections on Privacy and Ethics:

  1. Data Privacy Concerns: While analyzing and visualizing data, it’s crucial to ensure the protection of individuals’ privacy. Even though the dataset may be anonymized, the insights gained from the analysis could potentially lead to re-identification of individuals, especially in sensitive demographic variables like race or ethnicity (‘RIDRETH3’ and ‘RIDRETH1’). Care must be taken to avoid unintentional disclosure of sensitive information.

  2. Ethical Considerations: Ethical issues may arise when interpreting and communicating the results of data analysis. It’s essential to present findings accurately and responsibly, avoiding any biased interpretations or misrepresentations. Additionally, decisions based on data analysis should prioritize fairness and equity to avoid perpetuating existing inequalities or biases present in the data.

  3. Informed Consent: If the data used for analysis contains personally identifiable information, obtaining informed consent from the individuals involved is crucial. Individuals should be informed about how their data will be used and have the option to opt-out if they wish to protect their privacy.

  4. Data Security: Ensuring the security of the data throughout the analysis process is paramount. Measures such as encryption, access controls, and secure data storage should be implemented to prevent unauthorized access or data breaches.

  5. Transparency and Accountability: Transparency in data analysis practices, including documentation of methodologies and assumptions, fosters accountability and trust. Researchers should be transparent about the sources of data, analytical techniques employed, and any limitations or biases present in the data.

In conclusion, while data analysis provides valuable insights, it’s essential to approach it ethically and responsibly, considering the potential privacy implications and ensuring that the rights and dignity of individuals are respected throughout the process.

Conclusion:

In conclusion, this data analysis project has provided valuable insights into the relationships and distributions within the dataset. Through the creation of interactive visualizations, we have uncovered meaningful patterns and trends that contribute to a better understanding of the data. However, it’s important to recognize the ethical implications of data analysis and to prioritize privacy and fairness in our approach. By adhering to ethical guidelines and promoting transparency, we can ensure that our data analysis efforts contribute positively to the field while upholding the rights and dignity of individuals.

Thank You for Reading the Report

Back to top